【Day 13】- 用 JSON 儲存爬來的 PTT 文章。(實戰 PTT 爬蟲 3/3)

2021 iThome 鐵人賽

DAY 13

AI & Data

網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術系列第 13 篇

13th鐵人賽

Vincent55

團隊肝已經，死了

2021-09-28 21:31:46

3192 瀏覽

分享至

前情提要

前一篇文章帶大家寫了能爬取持續爬取 PTT 文章的爬蟲。

開始之前

本篇將繼續帶各位寫 PTT 爬蟲，今天會將爬取到的文章內容用 JSON 檔案儲存起來。

預期效果

將爬取到的文章內容儲存於 JSON 檔案中。

實作

我們先定義一個存放所有文章資訊的串列

article_list = []

之後，我們將每一篇文章資訊存為一個字典，並將這個字典加入存放所有文章資訊的串列內。

title = art.find('div', class_='title').getText().strip()
if not title.startswith('(本文已被刪除)'):
    link = 'https://www.ptt.cc' + \
        art.find('div', class_='title').a['href'].strip()
author = art.find('div', class_='author').getText().strip()
article = {
    'title': title,
    'link': link,
    'author': author
}

之後讀者可以將存放所有文章資訊的串列輸出看是否正常，這邊統整一下目前的程式碼。

import requests
from bs4 import BeautifulSoup

article_list = []

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        if not title.startswith('(本文已被刪除)'):
            link = 'https://www.ptt.cc' + \
                art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        article = {
            'title': title,
            'link': link,
            'author': author
        }
        article_list.append(article)
        # print(f'title: {title}\nlink: {link}\nauthor: {author}')
    # 利用 Css Selector 定位下一頁網址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 當執行此程式時成立
if __name__ == '__main__':
    # 第一個頁面網址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先讓爬蟲爬 10 頁
    for now_page_number in range(10):
        print(f'crawing {url}')
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')
    print(article_list)

接下來，要將存放所有文章資訊的串列存為 JSON 檔案，我們使用的是 Python 中的 json 庫(內建)，記得將 json 引入。

import json

with open('ptt-articles.json', 'w', encoding='utf-8') as f:
    json.dump(article_list, f, indent=2,
              sort_keys=True, ensure_ascii=False)

再來將存為 JSON 檔案的程式碼加入爬蟲專案當中。

import requests
import json
from bs4 import BeautifulSoup

article_list = []

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        if not title.startswith('(本文已被刪除)'):
            link = 'https://www.ptt.cc' + \
                art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        article = {
            'title': title,
            'link': link,
            'author': author
        }
        article_list.append(article)
    # 利用 Css Selector 定位下一頁網址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 當執行此程式時成立
if __name__ == '__main__':
    # 第一個頁面網址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先讓爬蟲爬 10 頁
    for now_page_number in range(10):
        print(f'crawing {url}')
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')
    # 將存放所有文章資訊的串列存於 JSON 檔案中
    with open('ptt-articles.json', 'w', encoding='utf-8') as f:
        json.dump(article_list, f, indent=2,
                  sort_keys=True, ensure_ascii=False)

成功爬取連續多頁面 ptt 文章資訊並存於 JSON 檔中。